inverse q-learning
- Europe > Germany > Baden-Württemberg > Freiburg (0.07)
- North America > Canada (0.05)
- Europe > Sweden > Stockholm > Stockholm (0.05)
Deep Inverse Q-learning with Constraints
Popular Maximum Entropy Inverse Reinforcement Learning approaches require the computation of expected state visitation frequencies for the optimal policy under an estimate of the reward function. This usually requires intermediate value estimation in the inner loop of the algorithm, slowing down convergence considerably. In this work, we introduce a novel class of algorithms that only needs to solve the MDP underlying the demonstrated behavior once to recover the expert policy. This is possible through a formulation that exploits a probabilistic behavior assumption for the demonstrations within the structure of Q-learning. We propose Inverse Action-value Iteration which is able to fully recover an underlying reward of an external agent in closed-form analytically. We further provide an accompanying class of sampling-based variants which do not depend on a model of the environment. We show how to extend this class of algorithms to continuous state-spaces via function approximation and how to estimate a corresponding action-value function, leading to a policy as close as possible to the policy of the external agent, while optionally satisfying a list of predefined hard constraints. We evaluate the resulting algorithms called Inverse Action-value Iteration, Inverse Q-learning and Deep Inverse Q-learning on the Objectworld benchmark, showing a speedup of up to several orders of magnitude compared to (Deep) Max-Entropy algorithms. We further apply Deep Constrained Inverse Q-learning on the task of learning autonomous lane-changes in the open-source simulator SUMO achieving competent driving after training on data corresponding to 30 minutes of demonstrations.
Deep Inverse Q-learning with Constraints Appendix Gabriel Kalweit
Visualizations of the real and learned state-values of IA VI, IQL and DIQL can be found in Figure 7.Figure 7: Visualization of state-values for different numbers of trajectories in Objectworld. Table 2: Comparison between online and offline estimation of state-action visitations for the Ob-jectworld environment, given a data set with an action distribution equivalent to the true optimal Boltzmann distribution. The pseudocode of the tabular variant of Constrained Inverse Q-learning can be found in Algorithm 4. See [4] for further details of Constrained Q-learning.Algorithm 4: Tabular Model-free Constrained Inverse Q-learning The pseudocode of Deep Constrained Inverse Q-learning can be found in Algorithm 5. The lower row shows the EVD. 3 For DIQL, the parameters were optimized in the range of Hence, it can only increase.
- Europe > Germany > Baden-Württemberg > Freiburg (0.07)
- North America > Canada (0.05)
- Europe > Sweden > Stockholm > Stockholm (0.05)
- Europe > Germany > Baden-Württemberg > Freiburg (0.05)
- North America > Canada > British Columbia > Vancouver (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- (3 more...)
Deep Inverse Q-learning with Constraints
Popular Maximum Entropy Inverse Reinforcement Learning approaches require the computation of expected state visitation frequencies for the optimal policy under an estimate of the reward function. This usually requires intermediate value estimation in the inner loop of the algorithm, slowing down convergence considerably. In this work, we introduce a novel class of algorithms that only needs to solve the MDP underlying the demonstrated behavior once to recover the expert policy. This is possible through a formulation that exploits a probabilistic behavior assumption for the demonstrations within the structure of Q-learning. We propose Inverse Action-value Iteration which is able to fully recover an underlying reward of an external agent in closed-form analytically.
Multi-intention Inverse Q-learning for Interpretable Behavior Representation
Zhu, Hao, De La Crompe, Brice, Kalweit, Gabriel, Schneider, Artur, Kalweit, Maria, Diester, Ilka, Boedecker, Joschka
In advancing the understanding of decision-making processes, Inverse Reinforcement Learning (IRL) have proven instrumental in reconstructing animal's multiple intentions amidst complex behaviors. Given the recent development of a continuous-time multi-intention IRL framework, there has been persistent inquiry into inferring discrete time-varying rewards with IRL. To tackle the challenge, we introduce Latent (Markov) Variable Inverse Q-learning (L(M)V-IQL), a novel class of IRL algorthms tailored for accommodating discrete intrinsic reward functions. Leveraging an Expectation-Maximization approach, we cluster observed expert trajectories into distinct intentions and independently solve the IRL problem for each. Demonstrating the efficacy of L(M)V-IQL through simulated experiments and its application to different real mouse behavior datasets, our approach surpasses current benchmarks in animal behavior prediction, producing interpretable reward functions. This advancement holds promise for neuroscience and cognitive science, contributing to a deeper understanding of decision-making and uncovering underlying brain mechanisms.
- Europe > Germany > Baden-Württemberg > Freiburg (0.05)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > United Kingdom > England > Bristol (0.04)
- (2 more...)
Deep Inverse Q-learning with Constraints
Kalweit, Gabriel, Huegle, Maria, Werling, Moritz, Boedecker, Joschka
Popular Maximum Entropy Inverse Reinforcement Learning approaches require the computation of expected state visitation frequencies for the optimal policy under an estimate of the reward function. This usually requires intermediate value estimation in the inner loop of the algorithm, slowing down convergence considerably. In this work, we introduce a novel class of algorithms that only needs to solve the MDP underlying the demonstrated behavior once to recover the expert policy. This is possible through a formulation that exploits a probabilistic behavior assumption for the demonstrations within the structure of Q-learning. We propose Inverse Action-value Iteration which is able to fully recover an underlying reward of an external agent in closed-form analytically. We further provide an accompanying class of sampling-based variants which do not depend on a model of the environment. We show how to extend this class of algorithms to continuous state-spaces via function approximation and how to estimate a corresponding action-value function, leading to a policy as close as possible to the policy of the external agent, while optionally satisfying a list of predefined hard constraints. We evaluate the resulting algorithms called Inverse Action-value Iteration, Inverse Q-learning and Deep Inverse Q-learning on the Objectworld benchmark, showing a speedup of up to several orders of magnitude compared to (Deep) Max-Entropy algorithms. We further apply Deep Constrained Inverse Q-learning on the task of learning autonomous lane-changes in the open-source simulator SUMO achieving competent driving after training on data corresponding to 30 minutes of demonstrations.
- North America > United States > New York > New York County > New York City (0.14)
- Europe > Germany > Baden-Württemberg > Freiburg (0.05)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (10 more...)